23 research outputs found

    Robustness of Random Forest-based gene selection methods

    Full text link
    Gene selection is an important part of microarray data analysis because it provides information that can lead to a better mechanistic understanding of an investigated phenomenon. At the same time, gene selection is very difficult because of the noisy nature of microarray data. As a consequence, gene selection is often performed with machine learning methods. The Random Forest method is particularly well suited for this purpose. In this work, four state-of-the-art Random Forest-based feature selection methods were compared in a gene selection context. The analysis focused on the stability of selection because, although it is necessary for determining the significance of results, it is often ignored in similar studies. The comparison of post-selection accuracy in the validation of Random Forest classifiers revealed that all investigated methods were equivalent in this context. However, the methods substantially differed with respect to the number of selected genes and the stability of selection. Of the analysed methods, the Boruta algorithm predicted the most genes as potentially important. The post-selection classifier error rate, which is a frequently used measure, was found to be a potentially deceptive measure of gene selection quality. When the number of consistently selected genes was considered, the Boruta algorithm was clearly the best. Although it was also the most computationally intensive method, the Boruta algorithm's computational demands could be reduced to levels comparable to those of other algorithms by replacing the Random Forest importance with a comparable measure from Random Ferns (a similar but simplified classifier). Despite their design assumptions, the minimal optimal selection methods, were found to select a high fraction of false positives

    rFerns: An Implementation of the Random Ferns Method for General-Purpose Machine Learning

    Get PDF
    In this paper I present an extended implementation of the Random ferns algorithm contained in the R package rFerns. It differs from the original by the ability of consuming categorical and numerical attributes instead of only binary ones. Also, instead of using simple attribute subspace ensemble it employs bagging and thus produce error approximation and variable importance measure modelled after Random forest algorithm. I also present benchmarks' results which show that although Random ferns' accuracy is mostly smaller than achieved by Random forest, its speed and good quality of importance measure it provides make rFerns a reasonable choice for a specific applications

    Robust and efficient approach to feature selection with machine learning

    Get PDF
    Most statistical analyses or modelling studies must deal with the discrepancy between the measured aspects of analysed phenomenona and their true nature. Hence, they are often preceded by a step of altering the data representation into somehow optimal for the following methods.This thesis deals with feature selection, a narrow yet important subset of representation altering methodologies.Feature selection is applied to an information system, i.e., data existing in a tabular form, as a group of objects characterised by values of some set of attributes (also called features or variables), and is defined as a process of finding a strict subset of them which fulfills some criterion.There are two essential classes of feature selection methods: minimal optimal, which aim to find the smallest subset of features that optimise accuracy of certain modelling methods, and all relevant, which aim to find the entire set of features potentially usable for modelling. The first class is mostly used in practice, as it adheres to a well known optimisation problem and has a direct connection to the final model performance. However, I argue that there exists a wide and significant class of applications in which only all relevant approaches may yield usable results, while minimal optimal methods are not only ineffective but even can lead to wrong conclusions.Moreover, all relevant class substantially overlaps with the set of actual research problems in which feature selection is an important result on its own, sometimes even more important than the finally resulting black-box model. In particular this applies to the p>>n problems, i.e., those for which the number of attributes is large and substantially exceeds the number of objects; for instance, such data is produced by high-throughput biological experiments which currently serve as the most powerful tool of molecular biology and a fundament of the arising individualised medicine.In the main part of the thesis I present Boruta, a heuristic, all relevant feature selection method. It is based on the concept of shadows, by-design random attributes incorporated into the information system as a reference for the relevance of original features in the context of whole structure of the analysed data. The variable importance on its own is assessed using the Random Forest method, a popular ensemble classifier.As the performance of the Boruta method turns out insatisfactory for some important applications, the following chapters of the thesis are devoted to Random Ferns, an ensemble classifier with the structure similar to Random Forest, but of a substantially higher computational efficiency. In the thesis, I propose a substantial generalisation of this method, capable of training on generic data and calculating feature importance scores.Finally, I assess both the Boruta method and its Random Ferns-based derivative on a series of p>>n problems of a biological origin. In particular, I focus on the stability of feature selection; I propose a novel methodology based on bootstrap and self-consistency. The results I obtain empirically confirm the validity of aforementioned effects characteristic to minimal optimal selection, as well as the efficiency of proposed heuristics for all relevant selection.The thesis is completed with a study of the applicability of Random Ferns in musical information retrieval, showing the usefulness of this method in other contexts and proposing its generalisation for multi-label classification problems.W wi臋kszo艣ci zagadnie艅 statystycznego modelowania istnieje problem niedostosowania zebranych danych do natury badanego zjawiska; co za tym idzie, analiza danych jest zazwyczaj poprzedzona zmian膮 ich surowej formy w optymaln膮 dla dalej stosowanych metod.W rozprawie zajmuj臋 si臋 selekcj膮 cech, jedn膮 z klas zabieg贸w zmiany formy danych. Dotyczy ona system贸w informacyjnych, czyli danych daj膮cych si臋 przedstawi膰 w formie tabelarycznej jako zbi贸r obiekt贸w opisanych przez warto艣ci zbioru atrybut贸w (nazywanych te偶 cechami), oraz jest zdefiniowana jako proces wydzielenia w jakim艣 sensie optymalnego podzbioru atrybut贸w.Wyr贸偶nia si臋 dwie zasadnicze grupy metod selekcji cech: poszukuj膮cych mo偶liwie ma艂ego podzbioru cech zapewniaj膮cego mo偶liwie dobr膮 dok艂adno艣膰 jakiej艣 metody modelowania (minimal optimal) oraz poszukuj膮cych podzbioru wszystkich cech, kt贸re nios膮 istotn膮 informacj臋 i przez to s膮 potencjalnie u偶yteczne dla jakiej艣 metody modelowania (all relevant). Tradycyjnie stosuje si臋 prawie wy艂膮cznie metody minimal optimal, sprowadzaj膮 si臋 one bowiem w prosty spos贸b do znanego problemu optymalizacji i maj膮 bezpo艣redni zwi膮zek z efektywno艣ci膮 finalnego modelu. W rozprawie argumentuj臋 jednak, 偶e istnieje szeroka i istotna klasa problem贸w, w kt贸rych tylko metody all relevant pozwalaj膮 uzyska膰 u偶yteczne wyniki, a metody minimal optimal s膮 nie tylko nieefektywne ale cz臋sto prowadz膮 do mylnych wniosk贸w. Co wi臋cej, wspomniana klasa pokrywa si臋 te偶 w du偶ej mierze ze zbiorem faktycznych problem贸w w kt贸rych selekcja cech jest sama w sobie u偶ytecznym wynikiem, nierzadko wa偶niejszym nawet od uzyskanego modelu. W szczeg贸lno艣ci chodzi tu o zbiory klasy p>>n, to jest takie w kt贸rych liczba atrybut贸w w~systemie informacyjnym jest du偶a i znacz膮co przekracza liczb臋 obiekt贸w; dane takie powszechnie wyst臋puj膮 chocia偶by w wysokoprzepustowych badaniach biologicznych, b臋d膮cych obecnie najpot臋偶niejszym narz臋dziem analitycznym biologii molekularnej jak i fundamentem rodz膮cej si臋 zindywidualizowanej medycyny.W zasadniczej cz臋艣ci rozprawy prezentuj臋 metod臋 Boruta, heurystyczn膮 metod臋 selekcji zmiennych. Jest ona oparta o koncepcj臋 rozszerzania systemu informacyjnego o cienie, z definicji nieistotne atrybuty wytworzone z oryginalnych cech przez losow膮 permutacj臋 warto艣ci, kt贸re s膮 wykorzystywane jako odniesienie dla oceny istotno艣ci oryginalnych atrybut贸w w kontek艣cie pe艂nej struktury analizowanych danych. Do oceny wa偶no艣ci cech metoda wykorzystuje algorytm lasu losowego (Random Forest), popularny klasyfikator zespo艂owy.Poniewa偶 wydajno艣膰 obliczeniowa metody Boruta mo偶e by膰 niewystarczaj膮ca dla pewnych istotnych zastosowa艅, w dalszej cz臋艣ci rozprawy zajmuj臋 si臋 algorytmem paproci losowych, klasyfikatorem zespo艂owym zbli偶onym struktur膮 do algorytmu lasu losowego, lecz oferuj膮cym znacz膮co lepsz膮 wydajno艣膰 obliczeniow膮. Proponuj臋 uog贸lnienie tej metody, zdolne do treningu na generycznych systemach informacyjnych oraz do obliczania miary wa偶no艣ci atrybut贸w.Zar贸wno metod臋 Boruta jak i jej modyfikacj臋 wykorzystuj膮c膮 paprocie losowe poddaj臋 w rozprawie wyczerpuj膮cej analizie na szeregu zbior贸w klasy p>>n pochodzenia biologicznego. W szczeg贸lno艣ci rozwa偶am tu stabilno艣膰 selekcji; w tym celu formu艂uj臋 now膮 metod臋 oceny opart膮 o podej艣cie resamplingowe i samozgodno艣膰 wynik贸w. Wyniki przeprowadzonych eksperyment贸w potwierdzaj膮 empirycznie zasadno艣膰 wspomnianych wcze艣niej problem贸w zwi膮zanych z selekcj膮 minimal optimal, jak r贸wnie偶 zasadno艣膰 przyj臋tych heurystyk dla selekcji all relevant.Rozpraw臋 dope艂nia studium stosowalno艣ci algorytmu paproci losowych w problemie rozpoznawania instrument贸w muzycznych w nagraniach, ilustruj膮ce przydatno艣膰 tej metody w innych kontekstach i proponuj膮ce jej uog贸lnienie na klasyfikacj臋 wieloetykietow膮

    Kendall transformation

    Full text link
    Kendall transformation is a conversion of an ordered feature into a vector of pairwise order relations between individual values. This way, it preserves ranking of observations and represents it in a categorical form. Such transformation allows for generalisation of methods requiring strictly categorical input, especially in the limit of small number of observations, when discretisation becomes problematic. In particular, many approaches of information theory can be directly applied to Kendall-transformed continuous data without relying on differential entropy or any additional parameters. Moreover, by filtering information to this contained in ranking, Kendall transformation leads to a better robustness at a reasonable cost of dropping sophisticated interactions which are anyhow unlikely to be correctly estimated. In bivariate analysis, Kendall transformation can be related to popular non-parametric methods, showing the soundness of the approach. The paper also demonstrates its efficiency in multivariate problems, as well as provides an example analysis of a real-world data

    Feature Selection with the Boruta Package

    Get PDF
    This article describes a R package Boruta, implementing a novel feature selection algorithm for finding \emph{all relevant variables}. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented.

    Feature Selection with the Boruta Package

    Get PDF
    This article describes a R package Boruta, implementing a novel feature selection algorithm for finding emph{all relevant variables}. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented
    corecore